


Institutional Archive of the Naval Postgraduate School 





Calhoun: The NPS Institutional Archive 
DSpace Repository 


Theses and Dissertations 1. Thesis and Dissertation Collection, all items 


1981-03 


Investigation of parameters affecting voice 
recognition systems in C3 systems. 


Batchellor, Mary Pamela 


Monterey, California. Naval Postgraduate School 


http://ndl.handle.net/10945/20575 


Downloaded from NPS Archive: Calhoun 


Calhoun is the Naval Postgraduate School's public access digital repository for 


f (8 D U DLEY research materials and institutional publications created by the NPS community. 
«ist : Calhoun is named for Professor of Mathematics Guy K. Calhoun, NPS's first 


NY KNOX appointed — and published -- scholarly author. 

| | LIBRARY Dudley Knox Library / Naval Postgraduate School 

411 Dyer Road / 1 University Circle 
Monterey, California USA 93943 





http://www.nps.edu/library 





INVESTIGATION OF PARAMETERS AFFECTING 
VOICE RECOGNITION SYSTEMS IN C3 SYSTEMS 


Mary Pamela Batchellor 





NAVAL POSTGRADUATE SCHOOL 


Monterey, California 





1ritS)S 


INVESTIGATION OF PARAMETERS AFFECTING 
VOEGeEsrECOGNIEION SYSTEMS IN C3 SYSTEMS 


by 


Mary Pamela Batchellor 


Mareh 1981 


Thesis Advisor: G. K. Poock 





Approved for public release; distribution unlimited. 





bnelassitied 
SECURITY CLASSIFICATION OF THIS PAGE (When Date Entered) 


REPORT DOCUMENTATION PAGE 


2. GOVT ACCESSION NO. 





READ INSTRUCTIONS 
BEFORE COMPLETING FORM 


3. RECIPIENT'S CATALOG NUMBER 


3. TYPE OF REPORT & PERIOD COVERED 


4. TITLE (and Subtitie) 
Investigation of Parameters Affecting ee  o 


Voice Recognition Systems in C3 Systems 


AUTHOR(e) &. CONTRACT OR GRANT NUMBER(e) 


















as 





Mary Pamela Batchellor 











10. PROGRAM ELEMENT, PROJECT. TASK 
AREA &@ WORK UNIT NUMBERS | 





9. PERFORMING ORGANIZATION NAME ANDO AODORESS 


Naval Postgraduate School 
Monterey, California 93940 


- CONTROLLING OFFICE NAME ANDO ADORESS 12. REPORT CATE 
At 
Montere 4913 forni 92040 Or/. 


"MONITORING AGENCY NAME & ADORESS(I! different irom Controlling Office) | 18. SECURITY CLASS. (of thle report) 
Unelassiaired 



















OECLASSIFICATION/ COWNGRADING 
SCHEOULE 


Se. 
. OISTRIBUTION STATEMENT (of thie Report) 


Approved for public release; distribution unlimited. 


. CISTRIBUTION STATEMENT (of the edetract eniered in Biock 20, if different trem Report) 


- SUPPLEMENTARY NOTES 





- KEY WORDS (Continue on reveree aide it neceeeary and identify by block number) 


VTAG 

Voice Data Entry 
Automatic Word Recognition 
Voice Recognition 












= 


. ABSTRACT (Continue an reveree side if necessary and identify by bieck mander) 


This research investigates the use of a voice recognition 
System by military operators -- officer, enlisted, male and 
female. The application intended is the use of a discrete 
utterance voice recognition system in a command center 
environment. The system would be used by members of a watch 
team to execute ad hoc queries against an automated data base 








DD oem, 1473 coi TIow oF 1 Nov 68 135 C@SOLETE Unclassified 


JAW 73 


S/N 0102-014- 6601 | a UTTER ann pups Une >=> URS enema 
SECURITY CLASSIFICATION OF THIS PAGE (When Deore Entered) 


> urs : iz ' 908 -e.Gee dis /- 


= bet? Eeens: 


ee aR 
46 : rhe ae | ° 2a, 24 hive 4 ae . 7 + 


aoe era Co -——=  o¢@ 
r T ; 

; Ve ST ee ee 
a ame a — ewes aie 
( ay imam laa adteamocteaniigs : 

are 


a : mrs Cee 
















— 
ae > 4 el 
‘ ) -e 
- i. ' é 
— 2- 21.64—.. 
e 6 
' é 
7 i 
r ° 7 ‘etek 
oo - am o- 
: - 
@ vo eo 
_  —z— 
| ae 
e¢ wee 1 a aan 
/ : 
; = s720- 
9 ' “ +i2 , 
é okt 7 
‘ Ne, « S o* ans gto ge om > 


ett cer vee ye 

VU OUR ah “4 me 

> ES £32) Fer eT e J 
‘N40 59> s>lov «ahi 

38 Mi —. creme 

ie Che sIU2eK4 | 


AP wl 
' 6 OP 1) as ote) 4G wa 


Unclassified 


NN eens se eee ee eee eee acres near ene eee 
SeCUMTY CLASSIFICATION OF THIS PAGE/Mren Nose Entered. 





in support of their command center duties. The following 
factors were examined: 


-- the adaptability of a random sample of active 
duty military personnel to a voice input system. 


-- the accuracy of such a system. 
-- the effects of male versus female operators. 


-- the effects of officer versus enlisted operators -- the 
advantages/disadvantages of using three, five or ten 
trained passes to train the voice system. 


Results showed no significant difference in error rates 
between the categories of officer and enlisted nor between 
male and female. Three training passes had a slightly 
higher error rate than five or ten passes but five and 

ten passes were the same. 
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ABSTRACT 


This research investigates the use of a voice recognition 
System by military operators -- officer, enlisted, male and 
female. The application intended is the use of a discrete 
utterance voice recognition syStem in a command center environ- 
ment. The system would be used by members of a watch team to 
execute ad hoc queries against an automated data base in 
support of their command’ center duties. The following 
factors were examined: 


-- the adaptability of a random sample of active duty 
Military personnel to a voice input system. 


-- the accuracy of such a system. 
== the effects of male versus female operators. 
-- the effects of officer versus enlisted operators. 


-- the advantages/disadvantages of using three, five 
Si wtenecraining passes to train the voice system. 


Results showed no significant difference in error rates 
between the categories of officer and enlisted nor between 
male and female. Three training passes had a slightly 
higher error rate than five or ten passes but five and ten 


passes were the same. 





fo LNLTRODUCTION= - 


A. 


BACKGROUND- 


TABLE OF CONTENTS 


Peevores lechinology- - —- - =--- a 


2. Command, 


SpuECcilivas- 


are tOD—- --- - 


D. 


lhe 


PROCEDURE 


DioLtGN= - 


SU oo eels— 


EQUIPMENT 


Control and Communications - 


Peep Nhe vantaAb oso = -— =— =- = =— =| =---=+--=- 


TII. ANALYSIS AND RESULTS- - - - - -=-efretrefreret 


Ee 


hiro LAB SE S— 
RESULTS FOR 
peEevits FOR 


ReSoUarsS POR 
oe a 10- oa 


RESULTS FOR 
cree, 4, 


SEX - ---+-e 7277277777 


RANK -- OFFICER VS. ENLISTED- 


NUMBER OF TRAINING PASSES -- 


NUMBER OF UTTERANCE SYLLABLES 


ee ee ae, Se, Se SM 


ieee foe liooLON AND CONCLUSIONS- - - -=--+-+e-+-- 


APPENDIX A. 


APPENDIX 


Se PENDIX C. 


APPENDIX 


B. 


D. 


SUBJECT QUESTIONNAIRE AND 


ANSWER SHEET - - - - ---+efefeff+f-C 


PS ERUGLIONS TO SUBJECTS 


VOCABULARY - - - - - - —~-—|= == = = = = 


CONFUSION MATRIX - - = = = -=-fefefe 


13 
16 
iS 
18 
18 
20 
24 
25 
2a 
Zu 
oe 


30 


52 


38 


43 


45 
48 
50 


eye 





Meer ekerERENGES- - - - - 
INITIAL DISTRIBUTION LIST - 





0 


PS Op eG UiE S 


@oneeptual Design of Experiment 
Pemiomemt o€t-Up> - - - - - - - 
Graph for Errors vs. SEX- - - - 


Graph for Errors vs. RANK - - - 


Graph for Number of Training Passes 


Graph for Number of Training Passes 


Graph for Number of Training Passes 


of Training - --------- 


Graph for Errors vs. 
iprecolbarmming Passes -.- - - 


Graph for Errors vs. Number of Syllables for 


merce bhadning Passes- - - ~ 


Graph for Errors vs. Number of Syllables for 


Ten Training Passes - - : 


VS. 


VS. 


Vv Sa. 


= 


Rank- 


Sex - 


Order 


Number of Syllables for 


oy 


$9 


40 


41 





_ 
7 
7 
a 
i 
7 
: 





ACKNOWLEDGEMENTS 


IT wish to acknowledge the patient instruction of my 
thesis advisor, Professor G. K. Poock, who has made this 
thesis research an enjoyable educational experience. Also, 
I would like to acknowledge the time and assistance given 
by my second reader, Professor D. E. Neil, and a fellow 
Student, Captain J. W. Armstrong. Finally, I would like 
to thank the staff members and students of the Naval 
Postgraduate School who generously volunteered their time 


to be subjects for my research. 


| ae ve, 


; / | wat “a 


ea reur 


i. 





I. INTRODUCTION 


A. BACKGROUND 
1. Voice Technology 
"It 1s only a matter of time until automatic speech 
recognition (ASR) becomes a major force in man-machine 
communication because of the inherent advantages of 
Speech communication and our increasing need to commu- 
nicate with machines. The inherent advantages of speech 
ieerse irom 1itS Universality, convenience, and speed." 
meet. 1]. 

Speech is the human's fastest and most convenient 
method of communicating and consequently little or no 
Operator training 1s required if speech is used as the inter- 
face between man and computer. In experiments involving 
Speech and other forms of machine communication (e.g., 
typing), information is exchanged almost twice as fast with 
speech (Ref. 2]. In addition to the speed and ease of 
training, speech input frees the operators' hands and eyes 
bemeOther tasks [Ref. 3]. 

The use of voice input to machines can be categorized 
into three modes of operation: 

-- voice response. 
-- speaker verification. 
-- speech recognition. 
VOICE RESPONSE is the area of voice input which deals 


with speech synthesis -- voice readout of computer-stored 


data. The appropriate message is selected from a stored 





vocabulary by a synthesis program and then given to a 
synthesizer device which generates a signal for transmission 
Mer a VOice circuit [Ref. 4]. 

SPEAKER VERIFICATION involves authenticating the 
identity of a speaker according to measurements on his voice 
Signal. Applications for speaker verification systems 
include voice lock/unlock security systems and banking and 
Seedit transaction [Ref. 5]. 

SPEECH RECOGNITION is giving commands to machines 
by voice. The machine does not have to identify the speaker, 
only "recognize" what is said. The commands can be given 
by any speaker as long as his or her voice patterns match 
those parameters for the desired stored command. Speech 
recognition systems are used for baggage and parcel sorting, 
quality control on production lines and voice direction of 
machine tools. They are typified by small word vocabularies 
spoken by a small population of users or large vocabularies 
(several hundred words) for speakers who allow the machine 
to calibrate their voices [Ref. 6]. 

The first experiments with speech input to machines 
were done in the 1950's using vowel and digit recognition 
Systems. Today there are commercially available isolated 
word recognition systems which easily handle small vocabularies 
from a known set of speakers. Actual systems in use today 


include United Air Lines baggage handling system, Ford 
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Motor Company's assembly line inspection of cars and Union 
Carbide's nuclear products manipulation system at Oak Ridge 
and Lockheed's quality control inspection line in Sunnyvale, 
fadifLOrnia. 

There are two features which characterize the 
Somplexity of the speech recognition task: 


-- whether the speech is connected or spoken one word at 
a time. 


-- the size of the vocabulary. 
In connected speech the acoustic characteristics of sounds 
and words have greater variability. In addition, it is 
difficult to determine where one word ends and the next 
begins. As the number of words in the vocabulary and the 
number of different contextual variations per word increase, 
the storage required to store all reference patterns becomes 
enormous, 

The principal difficulty in automatic speech recog- 
nition is not due to a lack of speech understanding but to 
the massive amount of memory and time required to store and 
process the required data. Recent progress has been limited 
more by advances in data processing than in speech recognition 
technology [Ref. 7]. 

Therefore, a major disadvantage of speech recognition 
Systems is the requirement for large amounts of memory and 
processing time. Some additional problems are: 


-- speaker variability due to sex and dialect makes 
RecOomreton Very ditiicult. 
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-- speech communication is not private. 


-- speech communication may be subject to environmental 
noise and distortions. 


-- vOice input is expensive in comparison to other 
input/output devices. (The cost of voice input 
devices ranges from $200 to $80,000 which includes 
a wide variety of capabilities.) 

In spite of these restrictions, applications for 
voice systems today include several areas: 

a. voice readout of numerals. 

(1) telephone numbers. 

(2) assembly of equipment. 
(stock price quotations. 

(4) inventory reporting. 

(S$) automatic directory assistance. 


b. industrial applications. 


(1) special purpose computer programming for machine 
tools. 


(2) quality control inspection systems. 
(3) equipment handling and sorting systems. 
c. editing of financial information. 

This thesis will address another application for today's 
vOice recognition systems -- that of command and control. The 
implication here is not command and control in the sense of 
voice communication with machines but in the military appli- 
cation of a management information system which provides 


data on resources available. 
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Ze GoMmmland, “Control and Communications (C3) 

In 1972 the Honeywell 6000 computer (H6000) was 
installed at Commander in Chief Naval Forces Europe 
(CINCUSNAVEUR) in support of the World Wide Military 
Command and Control System (WWMCCS). The H6000 transferred 
CINCUSNAVEUR from the first generation of computer systems -- 
characterized by card decks and single job processing -- to 
the third generation of multiprogramming, timesharing and 
terminal input/output. What existed at CINCUSNAVEUR in 
the way of "computer support" prior to the H6000 was a very 
"user untriendly'' ANYUK computer which required a great 
deal of expertise and very specific procedures to operate. 

Consequently, when the H6000 was installed, the staff, 
conditioned by the difficulties of using the prior data 
processing equipment, was very reluctant to have a computer 
replace their filing cabinets. After several years of 
software changes, updates to the Navy WWMCCS Software 
Standardization System (NWSS) were being passed from the 
fleet by AUTODIN to the H6000. Messages were not manually 
manipulated unless they were kicked out of the system because 
of errors. 

In spite of the fact that inputs to the database were 
being electrically transmitted from AUTODIN to the H6000 
before the communication center could distribute the paper 


copy, the staff, for the most part, avoided the NWSS query 
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module and held to their filing cabinets. Training sessions 
given by the software developers on how to use NWSS were not 
well attended. User reaction to the system was so negative 
that a separate shop for monitoring the database and correct- 
ing the error messages had to be formed using ADP resources. 
That 1s, the users who were supposed to be responsible for 
data content passed the responsibility off to the data 
processors. 

In 1978, a preliminary evaluation of the man-machine 
interface of the NWSS query module was done by Naval Ocean 
Systems Center [Ref. 8]. The reason for the study was to 
investigate the possibility of simplifying the query module 
Since the module, while it is very powerful, is also rather 
confusing to the infrequent user. There are nonstructured 
query systems being tested on data bases similar to NWSS -- 
LADDER, for example -- which would provide the user with 
a much easier access to the data. LADDER (Language Access 
to Distributed Data Bases with Error Recovery) will allow a 
user to ask the computer a question in plain English (Where 
is the Kennedy?") instead of requiring a specific format and 
Specific command words. The free format LADDER query system 
has been in test and development status since 1977. 

But let's take it a step further. Even if a 
relatively free format query system was available from NWSS, 


chances are a good percentage of the staff would still not 
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Demeimterested -- because it Still requires the user to sit 
in front of a terminal and find characters which are 
randomly spread over the keyboard. (Would Star Trek ever 
have been so popular if Captain Kirk had to wheel up to 

a keyboard and begin typing instead of just facing the panel 
and speaking into it?) If using the NWSS query module was 
aS easy aS loading a tape of voice patterns and "speaking" 
the query to the computer, would there be less reluctance 

On the part of the staff and command center team to use the 
automated data base instead of going to the files? 

The problem of C3 today is significantly more complex 
than at any time in the past. To be competitive in today's 
automated world, some extenSion of man's memory and compu- 
tational abilities is needed. How can this capability be 
provided without requiring an excessive amount of training? 
Is it possible to provide a computer tool without requiring 
typing skills to use it? 

The easier it iS to access the data, the more likely 
the staffer will be to use it. The easiest way for a nondata- 
processor to interface with a computer is simply to talk to 
it. Consideration for the use of a voice interface with the 
automated information system would include such questions 
as: 

Is it feasible to utilize a voice recognition system 
in an environment such as a command center where each 


member of the watch team could query the computer by 
voice? 


ies 





Is it cost effective to train a military member 
to use a voice recognition system and could it be 
done in a negligible amount of time? 


Would voice input in terms of today's technology 
Pemedaptable fOr female as well as male usage? 


What are the tradeoffs in using three, five or 
ten training passes in terms of training time, error 
rates and user psychology? 

Would it be feasible in terms of system resources 
to store voice patterms for every member of the watch 
Poeclonm On the computer? 

Would stress vary the voice patterns to such an 
extent that the voice input system would be unacceptable 
in the varying stress situations of the command center 
environment? 

With these thoughts in mind, this thesis investigates the 
use of a voice recognition system by military operators -- 
male, female, officer, enlisted -- from technical and non- 


technical backgrounds. 


BOB IJIECTIVES 
The objective of this thesis was to explore the use of 
a vOlice recognition system by a random sample of active duty 
military personnel. Specifically, to determine the effective- 
mess of such a system in each of the following three cases: 
1. Male Operators versus female operators: 


The female voice generally has a higher pitch than 
the male voice due to the spread of the harmonics in 
the frequency spectrums of the female. This factor 
causes problems in frequency resolution and conse- 
quently the female voice has been particularly hard 
fOr machines to recognize [Ref. 9]. There has been 
very little work done with female subjects and voice 
recognition systems. Any system to be used in a 
command center environment will more than likely have 
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female as well as male operators. Thus, one of the 
main objectives of this study was to compare the 
error rates of the machine uSing operators of 

both sexes. 


Officer operators versus enlisted operators: 


Another group of subjects that has had little 
documented experience with the voice recognition 
System is that of enlisted personnel. Seemingly, 
there should be no difference between officer and 
enlisted. However, this assumption has not been 
tested. The likely candidate for use of the voice 
recognition system in the command center environment 
would be the enlisted member of the watch team. 
(Hopefully, the ease of use introduced by voice access 
would change this!) The emphasis in this study was 
in the use of operational personnel. The intent was 
to be realistic in the experience levels of the 
proposed operators in order to provide a true picture 
of the adaptability of the operators to the equipment 
and the training required for them to use the 
equipment. 


Three, five, or ten training passes to train the 
volce recognition system: 


The accepted algorithm used to train the voice 
recognition system in this experiment requires ten 
training passes to "learn" to recognize the operator's 
utterance. In am extensive vocabulary this can demand 
a considerable amount of time and can conceivably 
introduce errors in the training process if boredom 
and/or fatigue take over. There is an algorithm 
available to train uSing five or three utterances as 
well as ten. The final area examined was the use 

of three or five training passes vice ten. 
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A. DESIGN 

Figure 1 shows the conceptual design for this experiment. 
It is a three-way nested hierarchal analysis of variance. 
Each of the four groups -- male enlisted, male officer, 
female enlisted, female officer -- consists of ten subjects. 
Each subject trained and tested the voice recognition system 


using three, five and ten training passes in a random order. 


Eee oUBIECTS 

Forty active duty military volunteers participated in 
this study. There were ten female officers, ten female 
enlisted, ten male officers and ten male enlisted. 

The enlisted subjects were all Navy members stationed 
at the Naval Postgraduate School. Their ranks ranged from 
El to E8. Their rates were: Religious Program Specialist, 
Yeoman, Personnelman, Mess Management Specialist, Intelligence 
Specialist, Data Processor, Storekeeper, Air Intercept 


Controller, Electronics Technician (including fire control 


specialist). 
The officers were from three U.S. services -- Navy, Army, 
Air Force -- and the Canadian Forces. They ranged in grade 


from 03 to O05. All but two were NPS students in the C3, 


Operations Research, Telecommunications Management, 
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Intelligence, Personnel Management and Communications 
Engineering curricula. The other two were an Army chemical 
officer from Fort Ord and an Air Force navigator stationed 
at the Joint Chiefs of Staff. The backgrounds of the officers 
were: special warfare, National Oceanic and Atmospheric 
Administration, ADP, intelligence, telecommunications, 
meyptology, acquisition, aviator, aerospace engineering, 
management analysis and communications. 

Based on a questionnaire given to each subject before 
performing the exercise, all but four thought voice input 
would be easier and less frustrating than typing as a means 
of input to the computer. Sixteen of the forty subjects 
had used or seen voice input used but only two had more 


than an introduction to voice response systems. 


fe EQUIPMENT 

The equipment used in this research was a Threshold 
Technology, Incorporated, Model T600 discrete utterance 
voice recognition system which was located inside an 
Industrial Acoustic Company sound reduction chamber. The 
microphone used was a Shure SM10 head microphone. 

The Model T600 consists of four basic components (see 
Figure 2): 


-- preprocessor unit consisting of an analog speech 
preprocessor and a digital input/output interface. 


-- Operator console/microphone preamplifier. 
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-- tape cartridge unit. 

-- CRT display and console. 
The preprocessor accepts the speech from the microphone 
preamplifier, extracts speech parameters and converts these 
to digital signals which are processed by the microcomputer. 
The microcomputer compares the input signals with stored 
reference patterns to determine which, if any, of the vocabu- 
lary words were spoken. If a close match is found between 
tne input speech pattern and one of the reference patterns, 
a user defined character string 1s sent to the user's device 
via the output interface. If no match is found the system 
emits a "beep" sound. 

The reference patterns are generated during the "training 
mode'' which requires a speaker to repeat several repetitions 
of each utterance with a variety of inflections as would be 
used in normal speech. The number of repetitions required 
is usually ten but for this experiment additional logic was 
added to the T600 to allow the use of three or five repeti- 
tions. An utterance can be a single word ("grid") or group 
of words ("command and control'') lasting from a tenth of a 
second to two seconds. The only requirement is that the 
@eectance contain no pauses of a tenth of a second or 
greater. If a tenth of a second pause is made, the T600 
Will treat the sound as two utterances instead of the intended 


one. Up to 256 utterances are allowed on this system [Ref. 10]. 
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Each utterance processed by the T600 is passed through 
nineteen bandpass filters which span the speech spectrum. 
The overall signal spectral shape is then described using 
Meepectral shape detector which calculates the rate of change 
of energy level with respect to frequency. The spectral 
shape and its changes over time are calculated every two 
milliseconds to determine the presence or absence of thirty- 
two acoustic features. When the end of the utterance is 
detected, the duration of the utterance 1s divided into 
sixteen time segments and reconstructed into a normalized 
time base. The T600 extracts a 512-bit feature matrix -- 532 
binary features by 16 time features -- for each version of 
an utterance. Then all matrices (three, five or ten) are 
combined to produce a single reference matrix for an element. 

When an utterance is spoken for recognition by the T600 
a 512-bit descriptive matrix is calculated and weighted 
correlations between this matrix and each reference matrix 
describing the vocabulary utterances are calculated. The 
vocabulary with the largest correlation exceeding some preset 
threshold value is then selected as the utterance spoken. 
If no correlation exceeds the preset threshold value the 
7000 emits a "beep" sound [Ref. 11]. 

The T600 has a magnetic tape cartridge unlit which allows 
the user to build his vocabulary reference patterns and store 


them on a tape cartridge. When the subject wants to use the 
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mmmpment, the tape 1s loaded into the preprocessor unit. 
This also allows a user to build a vocabulary for different 
tasks. He can then load the voice patterns for the task 
memmeecdas tO execute. Since the operator is not dependent 

on any large computer to store his voice patterns, the equip- 


ment can easily be moved and still be operational. 


we) PROCEDURE 

At the beginning of the session, subjects were given a 
questionnaire regarding their opinions on voice input versus 
manual typing. (See Appendix A.) The objectives of the 
experiment were explained along with an introduction to the 
vOice recognition equipment used and the procedure to be 
followed. The subject was then seated in a controlled 
acoustical environment chamber in front of a video display 
and given instructions on how to train the equipment. (See 
Appendix B.) 

The vocabulary used in this test consisted of fifty 
utterances -- words and phrases -- varying in length from 
one to five syllables. The utterances were not chosen to 
test the machine's ability to distinguish between similar 
sounds -- "get'' and "met,'' for example. The only considera- 
tion in choosing the vocabulary was to have the same number 
of utterances in each syllable category -- ten one-syllable 
words, ten two-syllable words, etc. The vocabulary list 
1s shown in Appendix C. Appendix D contains the Confusion 


Matrix. 
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Once the subject was introduced to the experiment and 
equipment, the head mike was mounted and the subject began 
training the fifty-word vocabulary using either three, five 
or ten training passes. The number of training passes used 
first was randomly determined so that each would be used 
first the same number of times. That is, one-third of the 
subjects started out using ten training passes. Another third 
used three training passes first and the last third started 
out using five training passes. 

The training procedure involved repeating an utterance 
the required number of times and then testing the equipment 
by repeating the utterance two or three times. If the 
machine did not respond correctly two out of three times 
the utterance was retrained. Once the entire vocabulary 
was trained, the subject tested the equipment by reading 
through the vocabulary list twice (100 utterances). Any 
"beeps" or incorrect responses were noted by the experimenter. 
This entire procedure was repeated using a different number 
of training passes until each subject had trained and tested 
the equipment using three, five and ten training repetitions. 
Subjects were allowed to rest, ask questions, get a drink 


at any time during the procedure. 


E. DEPENDENT VARIABLES 
After the training session each subject read through the 


list of words two times. A record was kept of each time the 
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machine responsed with a "beep" or an incorrect utterance. 
A record was also kept of the time each subject took to 


moemplete the experiment. 
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Peay oh oeND RESULTS 


ee y POTHESES 
The following hypotheses were to be tested: 


1. Hypothesis regarding male and female subjects. 


Hy: "There is no difference between male and female 
users of the voice recognition system." 
Hy: "The null hypothesis is false." 


2. Hypothesis regarding officer and enlisted subjects. 


Hy: "There is no difference between officer and 
enlisted users of the voice recognition system," 


Hy: "The null hypothesis is false." 
3. Hypothesis regarding number of training passes. 
Hy: "There is no difference in recognition accuracy 
when a different number of training passes is 


used in the voice recognition system," 


H,: "The null hypothesis is false." 


fest osULTS FOR SEX 

The results of this experiment for male and female 
subjects are shown graphically in Figure 3. The machine's 
performance for men was slightly better than for women -- 1.8% 
error rate for men versus 2.1% for women based on twenty 
subjects making 6000 utterances in each sex category. 
However, the analysis of variance (ANOVA) results in Table I 
emow an F ratio of .45 which indicates no significant statisti1- 


cal difference in the gender of the operator. Thus the null 


Lar 





FEMALE 
SEX 


MALE 


J 
HLVY YOU ADVINAOYdd NVAW 


ERRORS VS. SEX 


Peete, 


28 





dee 


ANALYSIS OF VARIANCE 


SOURCE So 
Hota l 5, Ome 
Between Subjects Oley 2 

Male/Female .0199 
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jm - F ratio 
P - probability of error 
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Mapotnesis is not rejected. This result speaks highly 
for the algorithm used by Threshold. It would appear they 
have a good handle on the additional requirements needed 
to process the female voice. 

This result further establishes the possibility of using 
a vOlce recognition system in a command center environment. 
The highest probability of error occurred with female subjects 
but even then the mean percentage error was only 2.1%. That 
is, out of one hundred utterances (an utterance, again, 
being a single word or group of words) spoken by a female 
watch team member to the computer, all but three would be 
interpreted correctly. If these utterances were being typed, 
a greater probability of error would exist Since one 
utterance could have as many typing errors as there are 


characters in the utterance. 


See RESULITS FOR RANK -- OFFICER VS. ENLISTED 

Figure 4 shows the comparison of machine errors for the 
two categories of officer and enlisted. The machine's 
performance for the enlisted was slightly better than for 
officers -- 1.85% versus 2.05% mean error percentage based 
On twenty subjects making 6000 utterances in each rank 
category. 

However, the statistical results from the ANOVA (Table I) 
Show an F ratio of .42. Therefore, there is no significant 


Statistical difference in the error rate of the T600 when 
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used by officer or enlisted personnel. Based on these 
meatistics, the use of a voice system should be favorable 
to either military member of the watch team. 

Dp. RESULTS FOR NUMBER OF TRAINING PASSES -- THREE, FIVE 

OR TEN 

Figure 5 shows the relationship between number of 
training passes and rank. Figure 6 shows the relationship 
between number of training passes and sex. In each case 
the percentage of error for training the T600 with five or 
ten training passes is about the same -- around 1% error 
for both ranks and both sexes. However, the percentage 
of error using three training passes is significantly 
higher -- around 2.7% based on rank and 2.4% to 3% based on 
sex. 

This graphical interpretation is proven statistically 
in the ANOVA with a significance level of .01. That is, the 
F ratio is 9.14 which 1s well above the 4.79 required for 
an alpha level of .01. Based on the F ratio, the null 
hypothesis is rejected. Therefore, there is a significant 
difference in recognition accuracy of the T600 when a differ- 
ent number of training passes is used. A Duncan Range test 
was performed to verify that the difference in performance 
was between three training passes and five or ten training 
passes. Five and ten passes had about the same probability 


of error. Even though three training passes has a 
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Smeniticantly higher percentage of error over the five and 
meiepasses, it is still only a 3% error rate. 

The ANOVA also showed a significant interaction (alpha 
level less than .01) between the number of training passes 
used and the rank of the subject. This would imply that an 
enlisted user would have a lower error rate if he trained 
the system using five training passes and an officer user 
would get better recognition if he used ten training passes. 
A t-test was performed to determine if five and ten passes 
for officers and five and ten passes for enlisted were 
indeed different since this interaction seemed unrealistic. 
The t-test showed both t-statistics (.7682 for women officers 
and -1.3125 for enlisted women) were within the 95% acceptance 
region. Therefore, the t-test Shows there is no difference 
in error rate when using five or ten training passes for 
either officer or enlisted category. 

A possible explanation for enlisted performance being 
lower with ten training passes is that five passes allowed 
enough variation to build a good identity matrix and ten 
training passes invited such a degree of boredom that the 
performance was degraded. 

It 1S interesting to note although the manufacturer 
recommends ten training passes for the best performance of 
the system, the results of this study show no significant 


difference between five and ten training passes. This 
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result might only apply when a relatively small vocabulary 
is used but in a crisis situation this could suggest the 
use of five training passes to get a needed vocabulary on 
tape quickly. As one's experience with the T600 increases, 
the use of fewer training passes may be sufficient. 

The order in which subjects trained the equipment with 
the different number of training passes was randomly assigned 
to prevent any biases in case learning or fatigue factors 
were involved. Figure 7 shows the percent error rate versus 
number of training passes used in the order subjects trained. 
That 18, for all subjects who started out the experiment 
using three training passes, the percent error rate was 2.3%. 
For all subjects who used five training passes first, the 
percent error rate was 2%. Those subjects who used three 
training passes after training with five and ten passes 
had a percent error rate of 2.9%. 

If an improvement due to experience was a factor then 
five training passes was the only one which demonstrated 
this. However, the increaSe in errors as three training 
passes was used second and third could be due to the fact 
that subjects became accustomed to putting a lot of inflec- 
tions in the utterances and when only three passes was uSed, 
they ran out of training passes before running out of in- 
flections. The increase in errors when ten training passes 


was used last could easily be explained as the fatigue 
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factor. Most subjects took twice as long to train the fifty- 
Mord vOcabulary using ten training passes as they did using 
three passes. By the time they were training and testing 

for the third time the novelty had begun to wear off and 
vOlices were getting tired. 

A correlation was run on three passes versus five passes, 
five versus ten and three versus ten to see if a subject who 
performed well on three training passes did better with five 
and ten passes. Only the results of the three-five corre- 
lation, .67, are significant at .05. The five-ten correla- 
tion Was .23 and the three-ten correlation was .11. Neither 
of these is significantly close to 1 or -1l and, therefore, 
meretie correlation is evident for these two cases. 

E ats FOR NUMBER OF UTTERANCE SYLLABLES -- 1, 2, 3, 

Figures 8 through 10 show the error recognition rate 
for the number of training passes versus the number of syllables 
in the utterance. In Figure 8, using three training passes, 
the T600 misinterpreted one-syllable utterances (words 0 
through 4 and 25 through 29 in Appendix C) 28 times out of 
800 utterances (40 subjects x 10 utterances x 2 repetitions 
for each utterance) for a percentage error rate of 3.5%. 

With one exception the percentage error rate decreased as 
the number of syllables increased for all three training 


Matrices. This seems reasonable since a greater number of 
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FIGURE 9. ERRORS VS. NUMBER OF SYLLABLES 
FOR FIVE TRAINING PASSES 
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syllables give the T600 more unique data to build a recog- 
mrtiOn matrix for the utterance. The exception for both 
three and five passes is two syllables. That is, the 
percentage error rate decreases for utterances from one to 
five syllables with the exception of two syllables where 
the error rate is greatest. In the case of ten training 
Passes, the exception is three-syllable utterances, with 
one syllable having the greatest error rate. 

The percentage error rate for five training passes is 
Significantly better than three in all syllable categories. 
With the exception of two and five syllables it is also 
better than ten training passes. The best system performance 


was uSing five syllable utterances and ten training passes. 
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LV See SCUSSION AND CONCLUSIONS 


The main points brought out in the previous results 
section showed that: 
1. There was no difference in error rates among the 
categories of officer and enlisted users of the 


wMoLee recognition system. 


2. There was no difference in error rates among the 
categories of female and male users of the system. 


3. There was a significant difference in error rates 
of all categories when uSing three training passes 
vice five or ten passes but the five and ten training 
passes had the same error rates. 


4. There was significant interaction between rank 
and the number of training passes used. 


Based on these results there should be no problem 
technically or psychologically with the use of voice 
recognition systems by military men and women, officer 


or enlisted. Although this experiment was conducted in 


a sound reduction chamber, there are two T600 voice recog- 


nition systems located in the C3 Laboratory at the Naval 
Postgraduate School which are frequently in use. The C3 
Laboratory simulates the environment of a command center. 
There have been no problems with background noise in the 
use of this voice system. Professor R. Elster [Ref. 12] 
found similar results with his study on The Effects of 

Certain Background Noises on the Performance of a Voice 


Recognition System. 
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The enthusiasm and ease with which the subjects used 
and trained the equipment are positive signs for the 
successful use of voice recognition systems in command centers. 
At the time of this writing, a T600 system has been placed 
in the command center at Commander in Chief Pacific Fleet 
(CINCPACFLT). During the week of 1 December 1980, Dr. Gary 
Poock and LT Ellen Roland of the Naval Postgraduate School 
faculty gave a demonstration of the T600 voice recognition 
system to CINCPACFLT. That staff now has a T600 in the 
command center which is being experimented with in a variety 


of areas. 
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APPENDIX A 


SUBJECT QUESTIONNAIRE AND ANSWER SHEET 


Please answer the following questions with respect to 


your capabilities. 


For items 3 - 7 designate your feelings from strong 
Becling for manual input (far left box), no strong feeling 
C€ither way (middle box), strong feeling for voice input 


Meat right box). 


For items 8 and 9, designate your feelings from strong 
feelings in favor (far right box), no strong feelings either 


Way (middle box), strong feeling against (far left box). 


1. Have you ever used voice input? 

2. Have you ever seen voice input used? 

3. Which might be easier, manual typing input or voice 
input for communicating with a computer? 

4. Would you be more relaxed using manual typing input 
Gievalce input? 

5. Would you have more flexibility in entering items to a 
computer with voice input or manual typing input? 

6. Would voice input or manual typing allow you more time 
and freedom to do other things? 

7. Would you be more frustrated using voice input or 


manual typing? 
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fein general, do you like the idea of voice input? 
9. In general, do you think you would like to use voice 


input in every day tasks yourself if it were applicable? 
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SUBOPECIALTY 


( ) NPS STUDENT 
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(ORGANIZATION §& JOB TITLE) 
i, Neo NO 
Z. ES NO 
MANUAL VOICE 
TYPING NEUTRAL INPUT 
-. a ey 
meee fs UP UP Ut Of 
ene PP sk fst 
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meee fy fl lUf_ fl CU Ot 
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APPENDIX B 


INSTRUCTIONS TO SUBJECTS 


The fifty-word vocabulary being used with the voice 
recognizer in the experiment is attached to these instruc- 
tions. You will be required to repeat each word of this 
vocabulary three, five and ten times to train the recognizer 
to recognizer your particular patterns of each word. To 
facilitate recognition by the voice recognizer, you should 
include in the repetitions as many as possible of the 
different ways you might say the word in normal speech; for 
example, use different intonations and emphasis, and small 
variations in volume. 

In order to keep track of the number of times you 
say each word when using ten repetitions and to reduce 
breath noise, it is best to speak the ten repetitions in 
Several groups. For example, if the word is zero, it is 
better to group them as: 

000 - 000 - 0000 
or 
000 - 000 - 000 - 0 
rather than 


0000000000. 
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Please observe the following guidelines while inputting 
moemeece data to the recognizer. 


-- Speak each word crisply and quickly but do not 
overpronounce. 


-- Leave a distinct pause (specifically, at least one- 
tenth of a second of silence) between each word so 
that the recognizer can distinguish the end of one 
word from the beginning of the next. Do not leave 
a period of silence within a word or the recognizer 
will mistake it for two separate words. 


-- Avoid breathing into the microphone at the end of 


words as this will generate false inputs to the 
recognizer. 
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APPENDIX C 


VOCABULARY 
WORD # UTTERANCE WORD # Oa TeRAN GE 

0 GRID 25 FRE 

ik LAUNCH 26 TIME 

2 COURSE ay MAP 

5 GOLF 28 SCOPE 

4 SEEED ag MAINE 

5 MESSAGE 50 NEUTRAL 

6 ORDERS 51 Pelee eb) Ese 

e PLATFORM SZ WHISKEY 

8 SENSOR oo LIMA 

9 MISSILE 34 LOGOUT 
10 Sele TE So TRACK UNKNOWN 
11 NEGATIVE 36 LONGITUDE 
IZ SUBMARINE Sy, TORPEDO 
tS ENEMY 38 CU oetee R ie ONite 
14 EN EGU E jg ROMEO 
lS SAN FRANCISCO 40 PolGHi CONTROLLER 
16 HUMAN FACTORS 41 SEA OF JAPAN 
1g UNG EED STATES 42 HONOLULU 
18 Cleat eOUL CHARLIE 43 ADVANTAGES 
19 COLORADO 44 CONTINUOUS 

20 CONNECT TO CHARLIE 45 TASK FORCE COMMANDER 
od NORTH ATLANTIC MAP 46 NORTH CAROLINA 

oe COMMAND AND CONTROL 47 BEARING AND DISTANCE 
as COMPNUOUS. SPEECH 48 PLOT ALL SUBMARINES 
24 Voie TE CHN® LOGY 49 UNITED AIR LINES 
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Out of 240 utterances, each word on the left side was interpreted by the T600 as the word in the row with a number. 


The last coluan, beep, indicates the nuaber of tines the T600 rejected the utterance and beeped. 


Raw numbers were used because roundoff errore made percentage values insignificant. 
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